Introduction

Regression problem - NYC Taxi Fare Prediction

The main goal of this paper is to apply ML methods …

library(caret)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(tidymodels)
## Warning: package 'tidymodels' was built under R version
## 4.2.3
## Warning: package 'broom' was built under R version 4.2.3
## Warning: package 'dials' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'infer' was built under R version 4.2.3
## Warning: package 'modeldata' was built under R version 4.2.3
## Warning: package 'parsnip' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'recipes' was built under R version 4.2.3
## Warning: package 'rsample' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tune' was built under R version 4.2.3
## Warning: package 'workflows' was built under R version 4.2.3
## Warning: package 'workflowsets' was built under R version
## 4.2.3
## Warning: package 'yardstick' was built under R version 4.2.3
library(lightgbm)
## Warning: package 'lightgbm' was built under R version 4.2.3
library(parallel)
library(doParallel)
## Warning: package 'doParallel' was built under R version
## 4.2.3
library(yaml)
## Warning: package 'yaml' was built under R version 4.2.3
library(readr)
library(yaml)
library(tidyverse)
library(bonsai)
## Warning: package 'bonsai' was built under R version 4.2.3
library(tidymodels)
library(ranger)
## Warning: package 'ranger' was built under R version 4.2.3
library(kernlab)
library(plotfunctions)
## Warning: package 'plotfunctions' was built under R version
## 4.2.3
library(DT)
## Warning: package 'DT' was built under R version 4.2.3
library(geosphere)
## Warning: package 'geosphere' was built under R version 4.2.3

Reading the data and model specification from config.yaml file

source('functions.R')
config <- yaml.load_file("config.yaml")
df <- read_csv(config$dataset$raw)
## Warning in instance$preRenderHook(instance):
## It seems your data is too big for client-side
## DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Data Preparation

New columns creation

Let’s check whether

## # A tibble: 0 × 21
## # ℹ 21 variables: id <dbl>, dropoff_latitude <dbl>,
## #   dropoff_longitude <dbl>, fare_amount <dbl>,
## #   feat01 <dbl>, feat02 <dbl>, feat03 <dbl>, feat04 <dbl>,
## #   feat05 <dbl>, feat06 <dbl>, feat07 <dbl>, feat08 <dbl>,
## #   feat09 <dbl>, feat10 <dbl>, key <dttm>,
## #   passenger_count <dbl>, pickup_datetime <dttm>,
## #   pickup_latitude <dbl>, pickup_longitude <dbl>, …

Let’s see how many NA’s and Nulls do we have within the datatable.

##                        column_names na_count zero_count
## id                               id        0          0
## dropoff_latitude   dropoff_latitude        0       1667
## dropoff_longitude dropoff_longitude        0       1667
## fare_amount             fare_amount        0          0
## feat01                       feat01        0          1
## feat02                       feat02        0          0
## feat03                       feat03        0          0
## feat04                       feat04        0          0
## feat05                       feat05        0          0
## feat06                       feat06        0          0
## feat07                       feat07        0          0
## feat08                       feat08        0          1
## feat09                       feat09        0          0
## feat10                       feat10        0          0
## passenger_count     passenger_count        0         21
## pickup_latitude     pickup_latitude        0       1652
## pickup_longitude   pickup_longitude        0       1652
## pickup_date             pickup_date        0          0
## pickup_time             pickup_time        0          0
##                   na_percentage zero_percentage
## id                            0     0.000000000
## dropoff_latitude              0     1.852222222
## dropoff_longitude             0     1.852222222
## fare_amount                   0     0.000000000
## feat01                        0     0.001111111
## feat02                        0     0.000000000
## feat03                        0     0.000000000
## feat04                        0     0.000000000
## feat05                        0     0.000000000
## feat06                        0     0.000000000
## feat07                        0     0.000000000
## feat08                        0     0.001111111
## feat09                        0     0.000000000
## feat10                        0     0.000000000
## passenger_count               0     0.023333333
## pickup_latitude               0     1.835555556
## pickup_longitude              0     1.835555556
## pickup_date                   0     0.000000000
## pickup_time                   0     0.000000000

As all of the non-meaningful 0 values are in longitude/lattitude columns, we have decided to apply binary value for whether we have meaningful geospatial values or not.

Visualization